Pesquisa | Portal Regional da BVS

1.

Accelerating 4D image reconstruction for magnetic resonance-guided radiotherapy.

Lecoeur, Bastien; Barbone, Marco; Gough, Jessica; Oelfke, Uwe; Luk, Wayne; Gaydadjiev, Georgi; Wetscherek, Andreas.

Phys Imaging Radiat Oncol ; 27: 100484, 2023 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-37664799

RESUMO

Background and purpose: Physiological motion impacts the dose delivered to tumours and vital organs in external beam radiotherapy and particularly in particle therapy. The excellent soft-tissue demarcation of 4D magnetic resonance imaging (4D-MRI) could inform on intra-fractional motion, but long image reconstruction times hinder its use in online treatment adaptation. Here we employ techniques from high-performance computing to reduce 4D-MRI reconstruction times below two minutes to facilitate their use in MR-guided radiotherapy. Material and methods: Four patients with pancreatic adenocarcinoma were scanned with a radial stack-of-stars gradient echo sequence on a 1.5T MR-Linac. Fast parallelised open-source implementations of the extra-dimensional golden-angle radial sparse parallel algorithm were developed for central processing unit (CPU) and graphics processing unit (GPU) architectures. We assessed the impact of architecture, oversampling and respiratory binning strategy on 4D-MRI reconstruction time and compared images using the structural similarity (SSIM) index against a MATLAB reference implementation. Scaling and bottlenecks for the different architectures were studied using multi-GPU systems. Results: All reconstructed 4D-MRI were identical to the reference implementation (SSIM > 0.99). Images reconstructed with overlapping respiratory bins were sharper at the cost of longer reconstruction times. The CPU + GPU implementation was over 17 times faster than the reference implementation, reconstructing images in 60 ± 1 s and hyper-scaled using multiple GPUs. Conclusion: Respiratory-resolved 4D-MRI reconstruction times can be reduced using high-performance computing methods for online workflows in MR-guided radiotherapy with potential applications in particle therapy.

2.

Distributed large-scale graph processing on FPGAs.

Sahebi, Amin; Barbone, Marco; Procaccini, Marco; Luk, Wayne; Gaydadjiev, Georgi; Giorgi, Roberto.

J Big Data ; 10(1): 95, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37283690

RESUMO

Processing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph processing acceleration with Field-Programmable Gate Arrays (FPGA). FPGAs are programmable hardware devices that can be fully customised to perform specific tasks in a highly parallel and efficient manner. However, FPGAs have a limited amount of on-chip memory that cannot fit the entire graph. Due to the limited device memory size, data needs to be repeatedly transferred to and from the FPGA on-chip memory, which makes data transfer time dominate over the computation time. A possible way to overcome the FPGA accelerators' resource limitation is to engage a multi-FPGA distributed architecture and use an efficient partitioning scheme. Such a scheme aims to increase data locality and minimise communication between different partitions. This work proposes an FPGA processing engine that overlaps, hides and customises all data transfers so that the FPGA accelerator is fully utilised. This engine is integrated into a framework for using FPGA clusters and is able to use an offline partitioning method to facilitate the distribution of large-scale graphs. The proposed framework uses Hadoop at a higher level to map a graph to the underlying hardware platform. The higher layer of computation is responsible for gathering the blocks of data that have been pre-processed and stored on the host's file system and distribute to a lower layer of computation made of FPGAs. We show how graph partitioning combined with an FPGA architecture will lead to high performance, even when the graph has Millions of vertices and Billions of edges. In the case of the PageRank algorithm, widely used for ranking the importance of nodes in a graph, compared to state-of-the-art CPU and GPU solutions, our implementation is the fastest, achieving a speedup of 13 compared to 8 and 3 respectively. Moreover, in the case of the large-scale graphs, the GPU solution fails due to memory limitations while the CPU solution achieves a speedup of 12 compared to the 26x achieved by our FPGA solution. Other state-of-the-art FPGA solutions are 28 times slower than our proposed solution. When the size of a graph limits the performance of a single FPGA device, our performance model shows that using multi-FPGAs in a distributed system can further improve the performance by about 12x. This highlights our implementation efficiency for large datasets not fitting in the on-chip memory of a hardware device.

3.

High-Performance Acceleration of 2-D and 3-D CNNs on FPGAs Using Static Block Floating Point.

Fan, Hongxiang; Liu, Shuanglong; Que, Zhiqiang; Niu, Xinyu; Luk, Wayne.

IEEE Trans Neural Netw Learn Syst ; 34(8): 4473-4487, 2023 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-34644253

RESUMO

Over the past few years, 2-D convolutional neural networks (CNNs) have demonstrated their great success in a wide range of 2-D computer vision applications, such as image classification and object detection. At the same time, 3-D CNNs, as a variant of 2-D CNNs, have shown their excellent ability to analyze 3-D data, such as video and geometric data. However, the heavy algorithmic complexity of 2-D and 3-D CNNs imposes a substantial overhead over the speed of these networks, which limits their deployment in real-life applications. Although various domain-specific accelerators have been proposed to address this challenge, most of them only focus on accelerating 2-D CNNs, without considering their computational efficiency on 3-D CNNs. In this article, we propose a unified hardware architecture to accelerate both 2-D and 3-D CNNs with high hardware efficiency. Our experiments demonstrate that the proposed accelerator can achieve up to 92.4% and 85.2% multiply-accumulate efficiency on 2-D and 3-D CNNs, respectively. To improve the hardware performance, we propose a hardware-friendly quantization approach called static block floating point (BFP), which eliminates the frequent representation conversions required in traditional dynamic BFP arithmetic. Comparing with the integer linear quantization using zero-point, the static BFP quantization can decrease the logic resource consumption of the convolutional kernel design by nearly 50% on a field-programmable gate array (FPGA). Without time-consuming retraining, the proposed static BFP quantization is able to quantize the precision to 8-bit mantissa with negligible accuracy loss. As different CNNs on our reconfigurable system require different hardware and software parameters to achieve optimal hardware performance and accuracy, we also propose an automatic tool for parameter optimization. Based on our hardware design and optimization, we demonstrate that the proposed accelerator can achieve 3.8-5.6 times higher energy efficiency than graphics processing unit (GPU) implementation. Comparing with the state-of-the-art FPGA-based accelerators, our design achieves higher generality and up to 1.4-2.2 times higher resource efficiency on both 2-D and 3-D CNNs.

4.

Design of Fully Spectral CNNs for Efficient FPGA-Based Acceleration.

Liu, Shuanglong; Fan, Hongxiang; Luk, Wayne.

IEEE Trans Neural Netw Learn Syst ; PP2022 Dec 02.

Artigo em Inglês | MEDLINE | ID: mdl-36459611

RESUMO

Computing convolutional layers in the frequency domain using fast Fourier transformation (FFT) has been demonstrated to be effective in reducing the computational complexity of convolutional neural networks (CNNs). Nevertheless, the main challenge of this approach lies in the frequent and repeated transformations between the spatial and frequency domains due to the absence of nonlinear functions in the spectral domain, as such it makes the benefit less attractive for low-latency inference, especially on embedded platforms. To overcome the drawbacks in the existing FFT-based convolution, we propose a fully spectral CNN using a novel spectral-domain adaptive rectified linear unit (ReLU) layer, which completely removes the compute-intensive transformations between the spatial and frequency domains within the network. The proposed fully spectral CNNs maintain the nonlinearity of the spatial CNNs while taking into account the hardware efficiency. We then propose a deeply customized and compute-efficient hardware architecture to accelerate the fully spectral CNN inference on field programmable gate array (FPGA). Different hardware optimizations, such as spectral-domain intralayer and interlayer pipeline techniques, are introduced to further improve the performance of throughput. To achieve a load-balanced pipeline, a design space exploration (DSE) framework is proposed to optimize the resource allocation between hardware modules according to the resource constraints. On an Intel's Arria 10 SX160 FPGA, our optimized accelerator achieves a throughput of 204 Gop/s with 80% of compute efficiency. Compared with the state-of-the-art spatial and FFT-based implementations on the same device, our accelerator is 4 × â¼ 6.6 × and 3.0TEXPRESERVE3 â¼ 4.4 × faster while maintaining a similar level of accuracy across different benchmark datasets.

5.

Toward Full-Stack Acceleration of Deep Convolutional Neural Networks on FPGAs.

Liu, Shuanglong; Fan, Hongxiang; Ferianc, Martin; Niu, Xinyu; Shi, Huifeng; Luk, Wayne.

IEEE Trans Neural Netw Learn Syst ; 33(8): 3974-3987, 2022 08.

Artigo em Inglês | MEDLINE | ID: mdl-33577458

RESUMO

Due to the huge success and rapid development of convolutional neural networks (CNNs), there is a growing demand for hardware accelerators that accommodate a variety of CNNs to improve their inference latency and energy efficiency, in order to enable their deployment in real-time applications. Among popular platforms, field-programmable gate arrays (FPGAs) have been widely adopted for CNN acceleration because of their capability to provide superior energy efficiency and low-latency processing, while supporting high reconfigurability, making them favorable for accelerating rapidly evolving CNN algorithms. This article introduces a highly customized streaming hardware architecture that focuses on improving the compute efficiency for streaming applications by providing full-stack acceleration of CNNs on FPGAs. The proposed accelerator maps most computational functions, that is, convolutional and deconvolutional layers into a singular unified module, and implements the residual and concatenative connections between the functions with high efficiency, to support the inference of mainstream CNNs with different topologies. This architecture is further optimized through exploiting different levels of parallelism, layer fusion, and fully leveraging digital signal processing blocks (DSPs). The proposed accelerator has been implemented on Intel's Arria 10 GX1150 hardware and evaluated with a wide range of benchmark models. The results demonstrate a high performance of over 1.3 TOP/s of throughput, up to 97% of compute [multiply-accumulate (MAC)] efficiency, which outperforms the state-of-the-art FPGA accelerators.

Assuntos

Redes Neurais de Computação , Processamento de Sinais Assistido por Computador , Aceleração , Algoritmos , Computadores

6.

Performance-aware programming for intraoperative intensity-based image registration on graphics processing units.

Leong, Martin C W; Lee, Kit-Hang; Kwan, Bowen P Y; Ng, Yui-Lun; Liu, Zhiyu; Navab, Nassir; Luk, Wayne; Kwok, Ka-Wai.

Int J Comput Assist Radiol Surg ; 16(3): 375-386, 2021 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-33484431

RESUMO

PURPOSE: Intensity-based image registration has been proven essential in many applications accredited to its unparalleled ability to resolve image misalignments. However, long registration time for image realignment prohibits its use in intra-operative navigation systems. There has been much work on accelerating the registration process by improving the algorithm's robustness, but the innate computation required by the registration algorithm has been unresolved. METHODS: Intensity-based registration methods involve operations with high arithmetic load and memory access demand, which supposes to be reduced by graphics processing units (GPUs). Although GPUs are widespread and affordable, there is a lack of open-source GPU implementations optimized for non-rigid image registration. This paper demonstrates performance-aware programming techniques, which involves systematic exploitation of GPU features, by implementing the diffeomorphic log-demons algorithm. RESULTS: By resolving the pinpointed computation bottlenecks on GPU, our implementation of diffeomorphic log-demons on Nvidia GTX Titan X GPU has achieved ~ 95 times speed-up compared to the CPU and registered a 1.3-M voxel image in 286 ms. Even for large 37-M voxel images, our implementation is able to register in 8.56 s, which attained ~ 258 times speed-up. Our solution involves effective employment of GPU computation units, memory, and data bandwidth to resolve computation bottlenecks. CONCLUSION: The computation bottlenecks in diffeomorphic log-demons are pinpointed, analyzed, and resolved using various GPU performance-aware programming techniques. The proposed fast computation on basic image operations not only enhances the computation of diffeomorphic log-demons, but is also potentially extended to speed up many other intensity-based approaches. Our implementation is open-source on GitHub at https://bit.ly/2PYZxQz .

Assuntos

Gráficos por Computador , Processamento de Imagem Assistida por Computador/métodos , Monitorização Intraoperatória/instrumentação , Algoritmos , Humanos , Monitorização Intraoperatória/métodos , Distribuição Normal , Linguagens de Programação , Reprodutibilidade dos Testes , Software

7.

GeDi: applying suffix arrays to increase the repertoire of detectable SNVs in tumour genomes.

Coleman, Izaak; Corleone, Giacomo; Arram, James; Ng, Ho-Cheung; Magnani, Luca; Luk, Wayne.

BMC Bioinformatics ; 21(1): 45, 2020 Feb 05.

Artigo em Inglês | MEDLINE | ID: mdl-32024475

RESUMO

BACKGROUND: Current popular variant calling pipelines rely on the mapping coordinates of each input read to a reference genome in order to detect variants. Since reads deriving from variant loci that diverge in sequence substantially from the reference are often assigned incorrect mapping coordinates, variant calling pipelines that rely on mapping coordinates can exhibit reduced sensitivity. RESULTS: In this work we present GeDi, a suffix array-based somatic single nucleotide variant (SNV) calling algorithm that does not rely on read mapping coordinates to detect SNVs and is therefore capable of reference-free and mapping-free SNV detection. GeDi executes with practical runtime and memory resource requirements, is capable of SNV detection at very low allele frequency (<1%), and detects SNVs with high sensitivity at complex variant loci, dramatically outperforming MuTect, a well-established pipeline. CONCLUSION: By designing novel suffix-array based SNV calling methods, we have developed a practical SNV calling software, GeDi, that can characterise SNVs at complex variant loci and at low allele frequency thus increasing the repertoire of detectable SNVs in tumour genomes. We expect GeDi to find use cases in targeted-deep sequencing analysis, and to serve as a replacement and improvement over previous suffix-array based SNV calling methods.

Assuntos

Variação Genética , Genoma , Neoplasias/genética , Software , Algoritmos , Frequência do Gene , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Sequenciamento Completo do Genoma

8.

Run-time Reconfigurable Acceleration for Genetic Programming Fitness Evaluation in Trading Strategies.

Funie, Andreea-Ingrid; Grigoras, Paul; Burovskiy, Pavel; Luk, Wayne; Salmon, Mark.

J Signal Process Syst ; 90(1): 39-52, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-31998430

RESUMO

Genetic programming can be used to identify complex patterns in financial markets which may lead to more advanced trading strategies. However, the computationally intensive nature of genetic programming makes it difficult to apply to real world problems, particularly in real-time constrained scenarios. In this work we propose the use of Field Programmable Gate Array technology to accelerate the fitness evaluation step, one of the most computationally demanding operations in genetic programming. We propose to develop a fully-pipelined, mixed precision design using run-time reconfiguration to accelerate fitness evaluation. We show that run-time reconfiguration can reduce resource consumption by a factor of 2 compared to previous solutions on certain configurations. The proposed design is up to 22 times faster than an optimised, multithreaded software implementation while achieving comparable financial returns.

9.

Quantum Chemistry in Dataflow: Density-Fitting MP2.

Cooper, Bridgette; Girdlestone, Stephen; Burovskiy, Pavel; Gaydadjiev, Georgi; Averbukh, Vitali; Knowles, Peter J; Luk, Wayne.

J Chem Theory Comput ; 13(11): 5265-5272, 2017 Nov 14.

Artigo em Inglês | MEDLINE | ID: mdl-29019679

RESUMO

We demonstrate the use of dataflow technology in the computation of the correlation energy in molecules at the Møller-Plesset perturbation theory (MP2) level. Specifically, we benchmark density fitting (DF)-MP2 for as many as 168 atoms (in valinomycin) and show that speed-ups between 3 and 3.8 times can be achieved when compared to the MOLPRO package run on a single CPU. Acceleration is achieved by offloading the matrix multiplications steps in DF-MP2 to Dataflow Engines (DFEs). We project that the acceleration factor could be as much as 24 with the next generation of DFEs.

10.

Leveraging FPGAs for Accelerating Short Read Alignment.

Arram, James; Kaplan, Thomas; Luk, Wayne; Jiang, Peiyong.

IEEE/ACM Trans Comput Biol Bioinform ; 14(3): 668-677, 2017.

Artigo em Inglês | MEDLINE | ID: mdl-26955050

RESUMO

One of the key challenges facing genomics today is how to efficiently analyze the massive amounts of data produced by next-generation sequencing platforms. With general-purpose computing systems struggling to address this challenge, specialized processors such as the Field-Programmable Gate Array (FPGA) are receiving growing interest. The means by which to leverage this technology for accelerating genomic data analysis is however largely unexplored. In this paper, we present a runtime reconfigurable architecture for accelerating short read alignment using FPGAs. This architecture exploits the reconfigurability of FPGAs to allow the development of fast yet flexible alignment designs. We apply this architecture to develop an alignment design which supports exact and approximate alignment with up to two mismatches. Our design is based on the FM-index, with optimizations to improve the alignment performance. In particular, the n-step FM-index, index oversampling, a seed-and-compare stage, and bi-directional backtracking are included. Our design is implemented and evaluated on a 1U Maxeler MPC-X2000 dataflow node with eight Altera Stratix-V FPGAs. Measurements show that our design is 28 times faster than Bowtie2 running with 16 threads on dual Intel Xeon E5-2640 CPUs, and nine times faster than Soap3-dp running on an NVIDIA Tesla C2070 GPU.

Assuntos

Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Fatores de Tempo

11.

NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors.

Cheung, Kit; Schultz, Simon R; Luk, Wayne.

Front Neurosci ; 9: 516, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26834542

RESUMO

NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.

12.

On the use of programmable hardware and reduced numerical precision in earth-system modeling.

Düben, Peter D; Russell, Francis P; Niu, Xinyu; Luk, Wayne; Palmer, T N.

J Adv Model Earth Syst ; 7(3): 1393-1408, 2015 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-27642499

RESUMO

Programmable hardware, in particular Field Programmable Gate Arrays (FPGAs), promises a significant increase in computational performance for simulations in geophysical fluid dynamics compared with CPUs of similar power consumption. FPGAs allow adjusting the representation of floating-point numbers to specific application needs. We analyze the performance-precision trade-off on FPGA hardware for the two-scale Lorenz '95 model. We scale the size of this toy model to that of a high-performance computing application in order to make meaningful performance tests. We identify the minimal level of precision at which changes in model results are not significant compared with a maximal precision version of the model and find that this level is very similar for cases where the model is integrated for very short or long intervals. It is therefore a useful approach to investigate model errors due to rounding errors for very short simulations (e.g., 50 time steps) to obtain a range for the level of precision that can be used in expensive long-term simulations. We also show that an approach to reduce precision with increasing forecast time, when model errors are already accumulated, is very promising. We show that a speed-up of 1.9 times is possible in comparison to FPGA simulations in single precision if precision is reduced with no strong change in model error. The single-precision FPGA setup shows a speed-up of 2.8 times in comparison to our model implementation on two 6-core CPUs for large model setups.

13.

Dimensionality Reduction in Controlling Articulated Snake Robot for Endoscopy Under Dynamic Active Constraints.

Kwok, Ka-Wai; Tsoi, Kuen Hung; Vitiello, Valentina; Clark, James; Chow, Gary C T; Luk, Wayne; Yang, Guang-Zhong.

IEEE Trans Robot ; 29(1): 15-31, 2013 Feb 01.

Artigo em Inglês | MEDLINE | ID: mdl-24741371

RESUMO

This paper presents a real-time control framework for a snake robot with hyper-kinematic redundancy under dynamic active constraints for minimally invasive surgery. A proximity query (PQ) formulation is proposed to compute the deviation of the robot motion from predefined anatomical constraints. The proposed method is generic and can be applied to any snake robot represented as a set of control vertices. The proposed PQ formulation is implemented on a graphic processing unit, allowing for fast updates over 1 kHz. We also demonstrate that the robot joint space can be characterized into lower dimensional space for smooth articulation. A novel motion parameterization scheme in polar coordinates is proposed to describe the transition of motion, thus allowing for direct manual control of the robot using standard interface devices with limited degrees of freedom. Under the proposed framework, the correct alignment between the visual and motor axes is ensured, and haptic guidance is provided to prevent excessive force applied to the tissue by the robot body. A resistance force is further incorporated to enhance smooth pursuit movement matched to the dynamic response and actuation limit of the robot. To demonstrate the practical value of the proposed platform with enhanced ergonomic control, detailed quantitative performance evaluation was conducted on a group of subjects performing simulated intraluminal and intracavity endoscopic tasks.

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA